Newspaper NLP

The Natural Language Toolkit (NLTK) is a Python library for handling natural language processing (NLP).

This project aims to evaluate the headlines of the New York newspaper between January and July 2020 and see how the words like COVID and Virus have taken on relevance over these months.

I will cover the following topics in this project:

The project will contain:

Libraries and Dataset

Drop the 'section' column from the DataFrame.

Installing NLTK modules

Text Preprocessing and Exploratory Analysis

Splitting the Data Frame by Months

Exploratory Data Analysis (EDA)

I need analyse and clean all the headline.

First, I will join each headline in one string. Then the string will be splitted into a list where each word is a list item.

Tokenization

Split() method of Python strings is the most basic tokenizer, that uses white space as delimiter.

Remove Ponctuation

Stopwords Removal

Let's try to get the frequency distribution of these terms.

February

March

April

May

June

July

All Months

Top 50

Hereunder, we can see the top 50 words mentioned in january 2020.